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ABSTRACT 

To further our knowledge of the complex physical process of galaxy formation, it 
is essential that we characterize the formation and evolution of large databases of 
galaxies. The spectral synthesis STARLIGHT code of Cid Fernandes et al. (2004) was 
designed for this purpose. Results of STARLIGHT are highly dependent on the choice 
of input basis of simple stellar population (SSP) spectra. Speed of the code, which uses 
random walks through the parameter space, scales as the square of the number of basis 
spectra, making it computationally necessary to choose a small number of SSPs that 
are coarsely sampled in age and metallicity. In this paper, we develop methods based 
on diffusion map (Lafon & Lee, 2006) that, for the first time, choose appropriate bases 
of prototype SSP spectra from a large set of SSP spectra designed to approximate the 
continuous grid of age and metallicity of SSPs of which galaxies are truly composed. 
We show that our techniques achieve better accuracy of physical parameter estimation 
for simulated galaxies. Specifically, we show that our methods significantly decrease the 
age- metallicity degeneracy that is common in galaxy population synthesis methods. 
We analyze a sample of 3046 galaxies in SDSS DR6 and compare the parameter 
estimates obtained from different basis choices. 

Key words: methods: data analysis - methods: statistical - methods: numerical - 
galaxies: evolution - galaxies: formation - galaxies: stellar content 



1 INTRODUCTION 

Determining the physical parameters of large samples of 
galaxies is crucial to constrain our knowledge of the com- 
plicated physical processes that govern the formation and 
evolution of these systems. Ultimately, we seek methods that 
quickly and effectively map observed data from ~ 10 6 galax- 
ies to sets of parameters that describe the star formation and 
chemical histories of these galaxies. Due to the complexity 
of the mechanisms of galaxy evolution, it is critical that we 
adopt a model that uses few assumptions about how galaxies 
evolve. Moreover, full utilization of the available theoretical 
stellar models, avoiding ad hoc simplifications, is critical to 
obtaining accurate parameter estimates. 

There has been a large amount of work dedicated 
to estimating the physical parameters of databases of 
galaxies using different types of data. For example , 
spectral indice s (IWorthev I Il994l: iKauffmann et ail 120031; 



Glazebrook et alj 120031; iPanter. Heavens fc Jimenez! 



2005 



2006 



Tremonti et all 120041 : 
200911. emission features 



2005 



Chen et al.l 
19911; 



Gallaz z i et ah 

1 Mas-Hesse fc Kunthl 
Leitherer et al.l Il995 | ). and full high-resolution, broad- 
band spectra ifReichardt. Jimenez fc Heavens! l200ll: 



20031: ICid Fernandes et al l " |2004 

Mathis. Chariot fc Brinshmannl 120061 : lOcvirk et all ^ 
Asari et al.ll2007h have all been utilized to estimate the star 
formation histories (SFHs) and metallicities of galaxies. Re- 
cently, due to the abundance of high-quality, homogeneous 
databases of g alaxy spectra such as the Sloan Dig ital Sky 
Survey (SDSS. lYork et alj|2000l ; Istrauss et al.ll2002h . which 
provides high-resolution spectra for hundreds of thousands 
of galaxies, the approach of full high-resolution spectral 
fitting is now possible to infer the properties of large 
populations of galaxies. Recent work has shown promise in 
the application of thes e m ethods to SDSS g alaxy spectra 
fsee lPanter et al]|2007l and I Asari et al.l 120071 for reviews of 
results achieved by two such fitting methods). However, 
large-scale analyses, while possible, are not necessarily 
accurate, and can be computationally challenging. There is 
much room for improvement in accuracy of SFH parameter 
estimation. It is critical that we understand the shortcom- 
ings of the current models and improve their accuracy. At 
the same time, to achieve our goal of obtaining accurate 
SFH estimates for a large sample of galaxies, we cannot 
sacrifice the computational efficiency of current methods. 



* E-mail: jwrichar@stat.cmu.edu 



A common technique in the literature, called empiri- 
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cal population synthesis, is to model galaxies as mixtures of 
simple stellar populations (SSPs) with known physical pa- 
rameters. Rece n t studies tha t have used this method are , 
Bical ll 19881), iPelatl d 19971 ) . ICid Fernandes et al l (|200lf ). 



e.g. 
and 



Moultaka et al 



( 20041 ) . Under this model, galaxy data 
are treated as linear combinations of the observed prop- 
erties of SSPs. Historically, SSP parameters and observ- 
ables were derived from observations of well-understood 
stellar systems. More recent studies have instead used 
model-produced SSPs, such as t hose from the evolutionar y 
population synthesis models of iBruzual fe Chariot! < |2003h ■ 
The STARLIGHT spectr al fitting code, introduced by 
Fernandes et al.l (|2004h , fits observe d spectra with linear 
combi natio ns of SSPs from the models of lBruzual fc Charlotl 
l|2003l) . In ICid Fernandes et ail (|2005l ). it was shown that 
SFH parameters of simulated galaxy spectra could be re- 
covered by STARLIGHT in the absence of noise. However, 
their simulated spectra were generated and fit with the same 
basis of 45 SSPs, rendering the results difficult to generalize 
to the expected performance on real galaxies, which can be 
composed of SSPs on infinitely fine grids of age and metal- 
licity. 

Results of the STARLIGHT fitting code are highly de- 
pendent on the choice of basis of SSP spectra. The database 
of SSPs available to us could theoretically be infinitely large, 
encompassing all combinations of age, metallicity, initial 
mass function, evolutionary prescription, etc. Including too 
many SSPs in our basis causes the code to be prohibitively 
slow and the solution to be degenerate due to the inclusion 
of SSPs with almost identical spectra. If, however, we simply 
hand-select a few SSPs on a coarse grid of age and metallic- 
ity with which to model our data, then we effectively ignore 
large subsets of our SSP parameter space, leading to sub- 
optimal estimates for fits to databases of galaxies. Using a 
hand-selected basis of SSPs may also lead to the inclusion of 
multiple SSPs whose spectra are essentially identical, which 
can lead to degeneracies. 

There is much room for improvement in the accuracy of 
SFH estimates by exploring numerical methods of selecting 
small bases of SSPs. In this paper, we propose a method 
of choosing a small set of prototype spectra from a large 
database of SSP spectra based on their observable quan- 
tities. Our method attempts to capture much of the vari- 
ability of a large set of SSPs in a few numerically chosen 
pr ototype SSPs. The meth od is based on the diffus ion map 
of ICoifman fc Lafonl (120061 ) and lLafon fc Led (|2006l ). a non- 
linear technique that seeks a simple, natural representation 
of data that are complex and high-dimensional, such as high- 
resolution astronomical spectra. 

We utilize the diffusion K-means method to numer- 
ically find a basis of prototype SSPs with which we can 
fit galaxies to accurately estimate their SFHs. In a future 
study, we will consider the problem of computational speed 
in methods such as STARLIGHT, which takes > 5 minutes 
to fit each SDSS galaxy spectrum for a basis of size 150, and 
show how to quickly extend the SFH estimates obtained by 
using the methods in this paper to a large database of galax- 
ies by taking advantage of the geometry of the manifold on 
which the data lie. 

The applicability of the methods introduced here ex- 
tend beyond the problem of SFH estimation using high- 
resolution spectra. Any astronomical task where observed 



data (i.e. spectra, photometric data, spectral indices, etc.) 
are fit as linear combinations of observable data from sim- 
pler systems can benefit from intelligent numerical selection 
of prototypes. 

The outline of the paper is as follows. In §2 we describe 
the STARLIGHT spectral synthesis code, discuss the draw- 
backs of SSP bases employed in the literature, and intro- 
duce the diffusion isT-means method of prototype selection. 
We directly compare the basis of prototypes selected by our 
method to those used in the literature and those found using 
other methods. In §3 we fit several sets of realistic simulated 
galaxy spectra using the STARLIGHT code with different 
bases to directly compare the ability of each basis to accu- 
rately estimate galaxy parameters. In §4 we fit a set of SDSS 
galaxy spectra and analyse physical parameter estimates ob- 
tained by different bases. We conclude with some remarks 
and discussion of future directions in 85. 



2 METHODS 

2.1 Spectral synthesis 

2.1.1 Model and STARLIGHT fitting algorithm 

We adopt the model o f galaxy spectra as a mixture of simple 
systems introduced in lCid Fernandes et al.l l|2004l ): 



M\ = M. 



(1) 



where M\ is the model flux in wavelength bin A. The com- 
ponent £j is the jth basis spectrum normalized at wave- 
length Ao. The basis of SSP spectra £ = {£i, £k} is cho- 
sen before performing the analysis. The scalar Xj £ [0, 1] 
is the proportion of flux contribution of the j'th component 
at Ao, where S,=i x i = lj the vector x with components 
Xj, (j — 1, K) is the population vector of a galaxy. These 
flux fractions can be converted to mass fractions, fii, using 
the light-to-mass ratios of each basis spectrum £j at Ao. To 
describe the SFH of a galaxy, we will use time-binned ver- 
sions of x and fi, denoted x c and fi c , respectively. Using 
binned population vectors lets us avoid degeneracies that 
can occur in the estimation of the original population vec- 
tors. 

Other components of the model in ([TJ are the scaling 
parameter M\ , the reddening term, r\ = \Q~ 0A ( A >-~ A ^a\ 
and the Gaussian convolution, G(t>», <t„). The reddening 
term describes distortion in the observed spectrum due 
to foreground dust, and is modeled by the extinction law 
of ICardelli. Clayton fc Mathisl (|l989l ) with R v = 3.1. The 
Gaussian convolution, G(f*,a*), accounts for movement of 
stars within the observed galaxy with respect to our line- 
of-sight, and is parametrized by a central velocity v* and 
dispersion a». 

To fit individual gala xies, we use the STARLIGH T fit- 
ting routine introduced bv lCid Fernandes et all (|2005l ). The 
code uses a Metropolis algorithm plus simulated annealing 
to find the minimum of 



X 2 (x,M Xo ,A v ,v«,a,) = -M x )w x Y 



(2) 
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where 0\ is the observed flux in the wavelength bin A, M\ is 
the model flux in {1} , and w\ is the inverse of the noise in the 
measurement Ox . The summation is over the N\ wavelength 
bins in the observed spectrum. The minimization of ((2)1 is 
performed over K + 4 parameters: xi, xk, M\ , Ay, v m , 
and a*, where K is the number of basis spectra in £. The 
speed of the algorithm scales as K 2 . For the analysis of, e.g. 
~ 10 6 galaxy spectra in the SDSS database, it is crucial to 
pick a basis with a smal l number of spectra. 



In, for examp le, Cid Fernandes et al.l l|2004l ) and 



ICid Fernandes et all (l2005l l the STARLIGHT model Q 
and fitting algorithm are tested using simulations. The sim- 
ulated spectra are generated from the model in (fT| using 
K = 30 SSPs fromlTremqntil ([20031 ) and K = 45 SSPs from 
iBruzual fc Chariot! (|20ol T respectively. These authors fit 
the simulations using the same basis of SSPs that was used 
to generate the simulations. From their analyses, they con- 
clude that in the absence of noise, the algorithm accurately 
recovers the input parameters. In the presence of noise, the 
population vector x is not recovered, and it is advised that 
a time-binned description of the SFH be adopted. Although 
their use of the same basis for both generating and fitting 
simulated spectra is an appropriate test that the algorithm 
works, it is not a fair assessment of the expected perfor- 
mance of the methods for a database of real galaxy spectra. 
In £12.1.21 we discuss our concerns with their simulation as- 
sessments. In H2.2I we present a new suite of methods that 
we test with several sets of simulated galaxy spectra. These 
simulations, which are described in £[3l are designed to be 
more representative of a large database of galaxy spectra, 
such as SDSS. 



2.1.2 Stellar population models 

In reality, the population of all observed galaxies is not con- 
strained to be mixtures of a small number of SSPs on a 
discrete grid of age and metallicity. A more physically accu- 
rate representation of galaxies is as mixtures of an infinitely 
large basis of SSPs on a continuous grid of age and metal- 
licity and, depending on the complexity of the underlying 
physics, a possibly infinite grid of prescriptions for initial 
mass function and evolutionary track. In our simulations in 
fJ3]we simulate galaxies from a large database of ~ 10 3 SSPs 
on a fine grid of age and metallicity to determine whether 
we can accurately describe their SFHs, metallicities, and 
kinematic parameters using an appropriately chosen, com- 
putationally tractable basis of K = 45 or 150 spectra. 

We adopt a SSP database containing 1278 spectra. 
This set of spectra is used to both choose an appropri- 
ate basis ( £|2.2[) and generate simulated spectra (*|3]). The 
database is meant to represent the large population of SSPs 
from which observed galaxies can be composed. We us e 
the spectra fro m the models of IBruzual fc Charlotl {2003), 
computed for a IChabrierl (I2003T) IMF 'Padova 1994' evo- 
lutionary tracks llAlongi et all Il993l; iBressan et all Il993l; 



iFagotto et al.lll994dld;lGirardi et alJll996h , and STELIB li- 
brary i|Le Borgne et al.l T2003 ) . Theoretically, we could also 
vary the IMF and evolutionary tracks across spectra and use 
the same methods developed in £12-21 but for simplicity de- 
cide to fix them in this study. The SSPs in our database are 
generated on a grid of 213 approximately evenly log-spaced 
time bins from to 18 Gyrs and 6 different metallicities: 



Z = 0.0001, 0.0004, 0.004, 0.008, 0.02 and 0.05, where Z 
is the fraction of the mass of a star composed of metals 
(Zq = 0.02). This grid is designed to approximate a con- 
tinuous variation across age and Z. The coarseness in the 
current Z grid is due to limitations in the current stellar 
population models. Again, a finer grid of age and Z can be 
implemented with the techniques that we develop in £|2.2I 



2.2 Choosing an appropriate basis of prototype 
spectra 

In previous studies using the STARLIGHT code, researchers 
hand-selected sets of SSP spectra t o use in the model fitting. 
According to lCid Fernandes et all (|2004l ). "the elements of 
the base should span the range of spectral properties ob- 
served in the sample galaxies and provide enough resolu- 
tion in age and metallicity to address the desired scientific 
questions." In this same paper, they claim that their ba- 
sis of K = 30 SSPs adequately recovers parameters of age 
and metallicity in their simulations. However, as discussed in 
£12.11 these simulations are unrealistic since they employ the 
same basis to both generate and fit the simulated spectra. 

In this section we present methods to numerically de- 
termine a basis £ of a small number of prototype spectra, 
from a large set of SSP spectra. The resultant basis of proto- 
type spectra will be designed to capture a large proportion 
of the variation of the set of SSPs, which is chosen to span 
the range of observed spectral properties of a data set of 
galaxies at a high resolution in age and metallicity, such as 
the database of 1278 SSPs described in H2.1.2I The numer- 
ically chosen basis of prototype spectra, £, should recover 
physical parameters better than any hand-chosen basis of 
the same size because hand-selected bases ignore subsets of 
the parameter space and may cause the solution to be de- 
generate by including multiple SSPs with nearly identical 
spectra. In £J3] we show that the bases computed from our 
numerical methods outperform hand-selected bases used in 
the literature in parameter estimation for simulated galax- 
ies. 



2.2.1 Diffusion K -means 

Diffusion map is a popular new technique for data 
parametrizatio n and dimensional it y redu ction that has been 
developed by ICoifman fc Lafonl (120061) and Lafon fc 
(2006), and was recently used by iRichards et all |2009l ') to 
analyse SDSS spectra and lFreeman et all (|2009T ) to estimate 
photometric redshifts. As shown in these latter works, diffu- 
sion map has the powerful ability to uncover simple struc- 
ture in complicated, high-dimensional data. These analyses 
demonstrate that this simple structure can be used to make 
precise statistical inferences about the data. 

In the problem at hand, we begin with a set B of high- 
resolution broad-band SSP spectra hi,i = 1,...,N, where 
N can be very large and each hi has flux measurements in 
P 10 3 wavelength bins. From this set, our goal is to derive 
a small basis of K prototype spectra, £ = —,(,k}, each 
of length p, that captures most of the variability in B. In 
what follows, we briefly review the basics of the diffusion 
map technique and then show how it can be used to find an 
appropriate basis £. 
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The main idea of the diffusion map is that it finds a 
parametrization of a data set that preserves the connectivity 
of data points in a way that is dependent only on the rela- 
tionship of each datum to points in a local neighborhood. 
By preserving the local interactions in a data set, diffusion 
map is able to learn, e. g. a natural parametri zation of the 
of the spiral in Fig. 1 of Richards et alj (|2009h where other 
methods, such as principal components analysis, fail. 

Starting with our set of SSP spectra, B, in units of 
Lq/A per initial Mq, we normalize the spectra at Ao to 
ensure that we are not confused by absolute scale differences 
between individual spectra. We retain the scaling factors 
7i i A for later use to convert back to the original units. Next, 
we select a discrepancy measure between SSPs, i.e. 



s(bi, bj 



\! 



(3) 



which is the Euclidean distance between two SSP spectra. 
The specific choice of s(bi, bj) is often not crucial to the 
diffusion map; any reasonable measure of discrepancy will 
return similar results. It is important to note here the flex- 
ibility of the diffusion map method. For each problem, we 
can define a different discrepancy measure based on the type 
of data available to us and the types of information we want 
to incorporate. For example, we could define our measure s 
by weighting more heavily those regions around absorption 
lines which are highly correlated with stellar age and metal- 
licity. Contrast this flexibility with the rigidity of principal 
components analysis, which is only based on the correlation 
structure of the data. 

From our discrepancy measure in (3), we construct a 
weighted graph on our data set B, where the nodes are the 
individual SSP spectra and the weights on the edges of the 
graph reflect the amount of similarity between pairwise spec- 
tra. We define the weights as 



tu(bi, bj) = exp - — 



(4) 



where e should be small enough such that w(bi,bj) ~ 
unless bi and bj are similar, but large enough so that the 
entire graph is connected (i.e. every SSP has at least one 
non-zero weight to another SSP). In practice, we choose e 
that produces a basis that achieves the best fits to galaxy 
spectra (see Sj3j. By normalizing the rows of the weight ma- 
trix, W, to sum to unity, we can define a Markov random 
walk on the graph where the probability of going from bi to 
bj in one step is pi(b;,bj) = w(bi, bj)/ J2 k w(bi, bfc). We 
store all of these one-step probabilities in a matrix P. 

By the theory of Markov chains, the t-step transition 
pr obabilities are define d by P f . In the previous analyses 
in lRichards et all (120091 ) and lFreeman et al.1 < |2009h . the au- 
thors fixed t. In these previous papers, the choice of t was 
unimportant because the task was linear regression on the 
diffusion coordinates, in which the multipliers Aj of the dif- 
fusion map were absorbed into the linear regression coeffi- 
cients Pj . In the present work, our task is to determine a set 
of K prototype spectra. The tool we use is if-means clus- 
tering in diffusion space, where results are dependent upon 
intra-point Euclidean distances of the diffusion map, which 
in turn are dependent on t. 

In this paper we introduce a diffusion operator that con- 



siders all scales t = 1,2,3, ... simultaneously. We define the 
multi-scale diffusion map as 

Ai . , As , , A 



* : bi 



A 



■Vi(bO 



1-A: 



A, 



-V>m(bj 



where ipj and Aj are the right eigenvectors and eigenvalues of 
P, respectively, in a bi-orthogonal decomposition. Euclidean 
distance in the m-dimensional space described by equation 
(JSJ approximates the multi-scale diffusion distance, a dis- 
tance measure that utilizes the geometry of the data set by 
simultaneously considering all possible paths between any 
two data points at all scales t, in the Markov random walk 
constructed above. An advantage to using the multi-scale 
diffusion map, besides elimination of the tuning parameter 
t, is that it gives a more robust description of the structure of 
the data by considering the propagation of local interactions 
of data points through all scales. 

We also have three other tuning parameters: e, m (di- 
mensionality of the diffusion map) , and K (number of proto- 
types) over which to optimize the fits to galaxy spectra. For 
simplicity, we fix m by choosing the dimensionality of the dif- 
fusion map to coincide with a 95% drop-off in the eigenvalue 
multipliers in the multi-scale diffusion map. This is reason- 
able because most of the information about the structure of 
the data will be in the few dimensions with large eigenvalue 
multipliers. Simulations show that this choice of m produces 
results comparable to optimizing over a large grid of m. For 
the parameter K, fits should improve as the basis gets larger. 
However, we want to keep K small both for computational 
reasons and to avoid the occurrence of nearly identical spec- 
tra in our basis, which can cause degeneracies in our fits. In 
practice, we choose K=45 or 150 to directly compare our 
results with the bases used b y ICid Fernandes et al.l (|2005h 
(CF05) and lAsari et~afl ( |2007h (Asa07). This leaves us with 
only one tuning parameter, e, over which to optimize the 
diffusion if-means basis in fits to galaxy spectra. 

Finally, we determine a set of K prototype spectra by 
performing if-means clustering in the m-dimensional dif- 
fusion map representation (J5J) of the SSP spectra, B. The 
A'-means algorithm is a standard machine learning method 
used to cluster N data points into K groups. It works by 
minimizing the total intra-cluster variance, 



V 



£ J2 [i*a*) -c(s?j)i 



(6) 



j=l i:b;GSj 



where Sj is the set of points in the jth cluster and c(Sj) 
is the m-dimension al geometric centro id of the set Sj. As 
described in §3.2 of lLafon fc Lee] (2006), for diffusion maps 



c(Sj 



E 



fo(b0*(bi) 



i-.biSSi ^bi€Sj 



<t>o (bi 



(7) 



where (j>o is the trivial left eigenvector of P. The K -means 
algorithm begins with an initial partition of the data and 
alternately computes cluster centroids and reallocates points 
to the cluster with the nearest centroid. The algorithm stops 
when no points are allocated differently in two consecutive 
iterations. The final centroids define the K prototypes to be 
used in our basis. 

In Fig. [1] we plot the mapping of 1278 SSP spectra into 
the m = 3 dimensional diffusion space described by (JSJ . The 
large black dots denote the K-means centroids for K = 45. 
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Diffusion K-means for SSP spectra, K=45, e=100 



Normalized SSP spectra, CF05 



J- 2 





Figure 1. Diffusion K-means for SSP spectra. Representation 
of 1278 SSP spectra in 3-dimensional diffusion space. Large 
black dots denote the K = 45 centroids. Individual SSPs are 
coloured by cluster membership. The SSPs reside on a simple, 
low-dimensional manifold which is captured by the K prototypes. 



Individual SSPs are coloured by cluster membership. The K 
prototypes capture the variability of the SSP spectra along 
a low-dimensional manifold in diffusion space. Note that the 
density of prototypes varies along different parts of the man- 
ifold. This is due both to the local complexity of the manifold 
and the sampling of the original base of SSPs. This latter ef- 
fect, that we tend to obtain more prototypes in areas where 
there are more SSPs in our basis, can easily be corrected for 
by using a weighted A'-means method, described in §2.2.3, 
that accounts for varying density of SSPs along the mani- 
fold. 

The last step of the diffusion A"-means algorithm is to 
determine the K protospectra and their properties. First, 
the normalization constant for the jth prototype spectrum 
is 



E 



<Mb 4 )7i( b 



E 



b;gS 



0o (b<) 



and the jth protospectrum is 
/>o(bj)bj 



= 7j E 

b;eS,- 



Er 



4>o(hi) 



(8) 



(9) 



Similarly, the age, metallicity, and "stellar- mass fraction" — 
the percent of the initial stellar mass still in stars — of pro- 
totype j are naturally defined as 

0o(bi)*(bi) 



*(&) = 



E 
E 

biSSj 

E 



E 



b;6Sj 



f>o (bj 



Eb i6 s, <Mbi 
0o(bi)/*(bt) 



Ev 



t>o(bi 



(10) 



(11) 



(12) 



where t(hi),Z(bi), and /*(b;) are the age, metallicity and 
stellar-mass fraction of the ith SSP, respectively. Equations 
(|10|) - (|12[) are natural definitions of the prototype parameters 




ttJOU 

X 



Normalized prototype SSP spectra, Diffusion K-means 




GOOO 
X 



Figure 2. Basis spectra for CF05 and Diffusion K-means, 
coloured by logt. All spectra are normalized to 1 at Ao = 4020 A. 
The diffusion ii"-means basis covers the spectral range in gradual 
increments while the CF05 basis includes many young spectra 
with similar spectral properties and sparsely covers the spectral 
range of older populations. 



in the sense that the fit of this basis to a galaxy would 
yield exactly the same parameter estimates as the equivalent 
mixture of the original set of SSPs, B. 

In Fig. Owe plot the 45 SSP spectra in CF05 and the 45 
prototype spectra found by diffusion if-means. All spectra 
are normalized to 1 at Ao=4020A and are coloured by log 
age. The diffusion A"-means prototypes spread themselves 
evenly over the range of spectral profiles, capturing a grad- 
ual trend from young to old spectra. On the other hand, 
the CF05 basis includes many similar spectra from younger 
populations and sparsely covers the range of spectral profiles 
of older populations. 

See Algorithm 1 for a summary of the 
diffusion if-means algorithm. Sample Mat- 
lab and R code is available on the web at 
http : //www. stat . emu. edu/~annlee/sof tw are .htm| 
The diffusionMap R package is available at 
http : //cr an . r-pro j ect . org/ 

Algorithm 1 Diffusion if-means 
1: 
2: 

3 
4 
5 
6 
7 
8 
9 

10 
11 
12 



Normalize SSP spectra, B, at Ao, fix e > 

Compute similarity matrix, S ([3]) and weight matrix W 

a 

Compute P: pi(bi, b,-) = w(bi, b,-)/Efc w(bi, h k ) 
Decompose P: pi(bs,b,) = J2k AfcV'fc(bi) ( ? !> fc(bj) 
Project B to the m-dimensional diffusion map, ([5]) 
A'-means: 

Set fe=l. Fix A". 



Randomly partition data into sets S{ 
while Sf _1) + SP for some ? 



k = k + 1 

Compute cluster centroids, c(S^) 
Partition the data so that 



,S 







(0) 



bi 6 argmin 



|*(b 4 



c(5f 



))ll 



13 



end while 

14: return K protospectra ((9J and corresponding parame- 
ters (11011121) 
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2.2.2 Other K -means algorithms 

In !|3l we compare the fits obtained by the diffusion A-means 
basis to those found by two other numerical methods: prin- 
cipal components (PC) A-means and standard A-means. 
Each of these two methods is similar to diffusion A-means, 
but both have critical drawbacks. 

PC A -means works similarly to diffusion A-means (Al- 
gorithm 1) except that A-means is performed on the pro- 
jection of the normalized SSP spectra into principal com- 
ponents (PC) space, not diffusion space. For an example 
of the app lication of principal components analysis to SSP 
spectra see lRonen. Aragon-Salamanca fc Lahavl |l999). The 
main drawback to PC A-means is its assumption that the 
SSP spectra lie on a linear subspace of the original p > 10 3 
dimensional space. If the SSPs actually lie on a non-linear 
manifold, then the A prototypes may poorly capture the in- 
trinsic variation of the original SSPs because the non-linear 
structure will have been inappropriately collapsed on to a 
linear space by the principal components projection. 

Standard A -means is also similar to Algorithm 1, except 
that A-means is performed in the original N\ ~ 10 3 dimen- 
sional space; i.e., no reduction in dimensionality is done be- 
fore running A-means. There are two obvious drawbacks to 
this procedure. First, the algorithm generally is slow because 
distance computations are cumbersome in high dimensions 
and A-means usually takes more iterations to converge. For 
comparison, the dimensionality of the spaces used by diffu- 
sion and PC A-means are each < 10, a factor of 100 smaller 
than p. Second, and more importantly, appropriate proto- 
type spectra are difficult to find by standard A-means be- 
cause Euclidean distances, used by A-means to define clus- 
ters, are only physically meaningful over short distances. 
Diffusion A-means avoids this problem by clustering in dif- 
fusion space, in which Euclidean distance approximates dif- 
fusion distance, a measure that has physical meaning on all 
scales. The result is that standard A-means inappropriately 
relates SSPs that are not physically similar. 

In Fig. [3] we plot logZ versus logt for A=150 proto- 
types in Asa07 and diffusion, standard, and PC A-means. 
Notably, diffusion A-means finds a much higher density of 
prototypes with high \ogZ or high logt, reflecting the com- 
plicated manner in which SSPs with those properties vary 
with respect to Z and t. At the other extreme, the Asa07 
prototypes reside on a regular grid, and thus include many 
prototype spectra that are essentially identical and also ex- 
clude prototypes that have unique spectral properties. The 
standard and PC A-means prototypes also estimate more 
high \ogZ and high logt prototypes than a regular grid. 



2.2.3 Incorporating other information 

The methods introduced above use only the observable prop- 
erties of SSPs to choose representative prototypes. It may 
be the case that we want to incorporate other a priori infor- 
mation that we have about the SSPs and their relationship 
with the galaxies we are fitting. For instance, we might know 
that a SSP with a particular age and metallicity is generally 
found in the types of galaxies we are trying to fit, and hence 
will want to include in our basis a prototype with character- 
istics closely matching those of this SSP. This information 
can easily be incorporated with the framework introduced 
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Figure 3. Plot of log Z versus logt for K=150 prototypes in 
AsaOl and K-means bases. There is a high density of diffusion 
if -means prototypes along a curve that encompasses large ages 
and metallicities. Due to the complicated manner in which those 
SSP spectra vary, many more prototypes are needed to span that 
spectral range. Contrast this to the regular grid of Asa07 proto- 
types. 



above by defining an a priori weight, uii for each SSP in 
our database, where higher weights signify more importance 
of the SSP. 

By modifying the definition of the diffusion map geo- 
metric centroid (0 to be 



w^o(bi)*(bi 



and altering the A-means algorithm to minimize 



Vu 



(13) 



(14) 



instead of V in (|6]), we choose a basis that reflects both the 
numerical observable properties of the SSPs and our a priori 
knowledge. 

This weighted A-means algorithm can be used to cor- 
rect for the tendency of the prototype SSPs to depend on 
the specific set of SSPs, B, chosen. If, for example, we de- 
fine Wi as the inverse of the local density of SSP i, then the 
prototype SSPs in Fig. [T] would more uniformly cover the 
manifold of SSPs in diffusion space, and not tend to more 
heavily sample denser regions of the SSP manifold. In this 
manner, the prototype spectra will only depend on the spec- 
tral properties, and not the particular sampling of the set of 
SSPs. 



3 SIMULATIONS 

We use simulations to assess the performance of the dif- 
fusion A-means method for selection of a basis of proto- 
type SSP spectra compared to bases derived by other A- 
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means methods and the bases of K = 45 used by CF05 
and K — 150 used by Asa07. We analyse the performance 
of these methods on physically-motivated simulated spectra 
from two different prescriptions. All simulated galaxy spec- 
tra are generated as mix tures of a fine grid of 12 78 SSP 
spectra from the model of lBruzual fc Charlotl l|2003l ). as de- 
scribed in i]2.1.2l We construct the simulated spectra with a 
fine grid of SSPs because we expect real galaxies to consist of 
sub-populations of stars on a continuous grid of age and Z. 
This approach differs signi ficantly from the sim u lation pre- 
scriptions of, for example. ICid Fernandes et al.l (|2004T ) and 
ICid Fernandes et a l, (2005) as discussed in H2.1.2I 



Like 



Cid Fernandes et al.l l|2005h . we simulate spectra 



over the range 3650-8000 A. We bin all spectra to one mea- 
surement every 1 A. For each simulated galaxy, we inde- 
pendently sample the reddening parameter, Av, and veloc- 
ity dispersion, <r», from a realistic distribution based on fits 
to a large sample of SDSS galaxies by the STARLIGHT 
code that was performed by the SEAGal (Semi Empirical 
Analysis of Galaxies) Group and is available on the web at 
www.starlight.ufsc.br. We add Gaussian noise following 
the normalized mean SDSS error spectrum with specified 
S/N at A = 4020 A. We mask out any spectra l bins a round 
emission lines, as done in lCid Fernandes et al.l |2005h . 

To choose the optimal value of e for the diffusion K- 
means basis, we fit simulations to each basis from a grid of e 
values and find the basis for which the fits achieve the small- 
est median \ 2 value, ([2]). Although we choose the optimal e 
by minimizing x 2 , our ultimate goal is accurate estimation 
of the SFH and physical parameters for each galaxy. How- 
ever, we find that bases that achieve small x 2 values tend to 
have small errors in their physical parameter estimates. 

To compare results across different bases, we compare 
the errors in physical parameter and SFH estimation. Our 
ultimate goal is to find the basis that most accurately de- 
scribes the star formation and metallicity history of each 
galaxy. In this study, our main parameters of interest are 



(logi)^ = £ayIog(t(&))> 



3=1 



log(Z) L = lo,. I j 



(15) 
(16) 



Av, and a,. To test the ability of different bases to estimate 
these parameters, we compute the mean square error (MSE) 
of the estimates across the simulated galaxies. To charac- 
terize the ability to recover SFH, we compute condensed 
population vectors by binning both x and fi into 5 loga- 
rithmically spaced time bins and then comparing these true 
condensed vectors, x c and /i c to the estimated condensed 
vectors, x c and Other techniques have proposed different 
schemes for compre ssing the SFH in formation retrieved by 
spectral fitting (see lAsari et al. I l2007l for an overview of the 
different compression methods employed in current spectral 
synthesis codes). Our choice of 5 age bins is an attempt to 
avoid the degeneracies that are likely if the population vec- 
tor were gridded into very fine age bins while still retaining 
a sufficient number of age bins to detect significant recent 
starburst events. Admittedly, the choice of 5 age bins is ad 
hoc. However, because we are only interested in comparing 
the ability of different bases of prototype spectra to recover 



SFH, any reasonable parametrization of the SFH will allow 
us to perform this analysis. 



3.1 Exponentially decreasing SFH 

In the first set of simulations, we choose a SFH with ex- 
ponentially decaying star formation rate (SFR): SFR oc 
exp(7i), where t is the age of the SS P in Gyr and we use SSPs 
up to a maximum of 14 Gyrs. Like lToieiro et all (|2007f ). we 
set 7 = 0.3Gyr _1 to have a SFH that is neither dominated 
by recent star formation nor too similar to a single old burst 
galaxy. In o ur simulations, we consider specific SFR (SSFR) 
as defined in lAsari et al.1 (|2007l ) by normalizing all simulated 
spectra to have the same mass. This is reasonable because 
in practice we are interested in how the SFH varies across 
many galaxies, and therefore wish to remove the absolute 
mass-scale dependence of each object's SFR. We have SSPs 
from 196 time bins where each bin's metallicity is randomly 
chosen from 6 different values. We test our methods on 50 
simulated spectra. Across different spectra the SFH is the 
same, while log(Z*)L, Av, a, and the random noise all vary. 
Each spectrum has S/N=10 at Aq. 



3.1.1 Results 

The following bases of prototype spectra are used to fit 
each simulated spectrum: 1) the CF05 basis, 2) diffusion 
Tf-means, 3) PC Tf-means, and 4) standard A"-means. Each 
basis contains K = 45 prototype spectra. For diffusion K- 
means, we perform fits over a grid of values of e to study 
how parameter estimates vary as a function of e. 

In Fig. ||]we show the error in spectral fits using diffu- 
sion Jf-means for different values of e. Also plotted is the 
error in the standard A"-means (sK) and PC 7f -means (PC) 
bases. All errors are plotted as the ratio of the \ 2 statistic 
to the x 2 from the CF05 basis fit. In the vast majority of 
the simulated spectra, all three of the techniques considered 
produce better spectral fits than the CF05 basis. There are 
no significant differences between spectral fit errors for the 
diffusion Tf-means bases as e varies. The minimum median 
X 2 for diffusion _K"-means is achieved at e = 750. 

We plot the MSE of our estimates for 
(log t„)L, log{Z,)L, Av, and a, in Fig. [5] For each pa- 
rameter, the CF05 basis predictions are the worst in terms 
of MSE. The performance of the diffusion K-means basis 
is not very dependent on the particular choice of e. This 
is a nice property, as any reasonable choice of e will give 
us good results, and we need not spend computing time 
searching for an optimal value of e. Diffusion if-means 
bases tend to find the best estimates of (log t»)L, Av, and 
cr*, and performs competitively with the other -means 
bases for log(Z*)L estimation. 

We compare the condensed SFH estimates across bases. 
Plots of the recovered x c and fi c vectors in Fig. [6] show that 
the CF05 basis tends to significantly underestimate the con- 
tributions of the oldest population while overestimating the 
contribution of the second oldest population. All three K- 
means bases find estimates that are closer to the true x c , 
but tend to underestimate the contribution of the oldest x c 
component. All three 7f -means bases find very accurate es- 
timates of jl c . The bias in x c estimation occurs for at least 
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Figure 4. Spectral fit errors to simulated spectra with exponen- 
tially decreasing SFH, 7 = 0.3. Errors are plotted relative to the 
error obtained by the CF05 basis (dashed line). For each value 
of e, diffusion i\-means outperforms the CF05 basis, as do both 
standard i\-means (sK) and PC K-means (PC). 
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Figure 5. MSE of four parameter estimates, for simulated spec- 
tra with exponentially- decaying SFH, 7 = 0.3. MSE is computed 
across 50 simulated spectra with S/N=10. Parameter estimates 
from the diffusion i\-means bases (solid dots) are plotted along 
with those from the CF05 basis (solid line), standard i-f-means 
basis (dashed line), and PC if-means basis (dotted line). The 
diffusion K-means basis generally outperforms the other bases 
considered; CF05 estimates poorly for all 4 parameters. 



two reasons. First, since the simulations are generated from 
SSPs on a fine age grid, it is intrinsically difficult to estimate 
all of the contributions of the subpopulations. Second, due 
to the exponential form of the SFR, older populations are 
the most important in the SFH, making it easy to underes- 
timate their importance and almost impossible to overesti- 
mate their contribution. 

For this simple prescription of galaxy evolution, bases 
constructed using K-means methods find significantly bet- 
ter estimates of physical parameters and SFH than the CF05 
basis. Next, we test our methods on a slightly more compli- 
cated set of simulations and also compare bases of size 45 
and 150 to characterize the advantage, if any, of using a 
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Figure 6. Estimated star formation histories for simulated spec- 
tra with exponentially- decaying SFH, 7 = 0.3. The estimated 
SFHs for 4 different bases are plotted as dots with dashed lines 
to denote the median estimated contribution of galaxy light (left 
column) and galaxy mass (right column) formed in each time bin 
for fits to 50 simulations. Error bars denote the 90% prediction 
intervals for the fits. The true, underlying model, SFR oc exp(Tt), 
is plotted as a solid line. The CF05 basis performs much worse 
than the K-means bases, confusing old populations for younger 



larger basis and whether this advantage is worth the extra 
expenditure in computation time. 



3.2 Exponentially decreasing SFH with random 
starbursts 

In our second set of s imulations, we use a prescription similar 
to lChen et ail (2009) to create a set of more realistic spectra 
on which to test our methods. The added SFH complexities 
over the simulations in i)3.1lare: 



(i) The time tf orm when a galaxy begins star formation 
is distributed uniformly between and 5.7 Gyr after the Big 
Bang, where the Universe is assumed to be 13.7 Gyr old. 

(ii) We again have SFR oc exp(7t), but now allow 7 to 
vary between galaxies. For each galaxy we draw 7 from a 
uniform distribution between 0.7 and 1 Gyr - . 

(iii) We allow for starbursts with equal probability at all 
times. The probability a starburst begins at time t is nor- 
malized so that the probability of no starbursts in the life 
of the galaxy is 33%. The length of each burst is distributed 
uniformly between 0.03 and 0.3 Gyr and the fraction of total 
stellar mass formed in the burst in the past 0.5 Gyr is dis- 
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Figure 7. MSE of four parameter estimates for simulated spec- 
tra with exponentially decreasing SFH and random starbursts. 
Plotted for each basis is the MSE plus standard error across 500 
simulations. The A"-means bases outperform bases from the lit- 
erature for both choices of K. Diffusion i^-means tends to find 
the best parameter estimates, showing no significant change in 
accuracy between K=45 and 150. 



tributed logarithmically between and 0.1. The SFR of each 
starburst is constant throughout the length of the burst. 

As above, each galaxy spectrum is generated as a mixture 
of SSPs of up to 196 time bins, with a randomly drawn 
metallicity in each bin. Allowing for random starbursts over 
the exponential SFH increases the importance of younger 
populations in the simulated galaxies. We test our methods 
on 500 simulated spectra, 250 with S/N=10 and 250 with 
S/N=20 at An. Across each spectrum, SFH, log(Z)_L, Ay, a, 
and the random noise all vary. We fit bases of size 45 and 
150. 



3.2.1 Results 

For this set of simulations, we only look at results for the 
diffusion Tf-means e that achieves the lowest \ 2 fit error 
for each K; the optimal values are e=250 for the K — 45 
basis and e=500 for the K = 150 basis. As before, results 
are not very sensitive to the specific choice of e. Note that 
the e=250 basis of size 45 also achieved good fits to the 
simpler simulations in i|3.1l In Fig. [7] we plot the MSE of 
the physical parameter estimates for each of the 4 bases 
for K=45 and 150. The diffusion Tf-means bases perform 
significantly better than all other bases for mean age and 
Ay estimation and achieve similar results as the other K- 
means bases for \og{Z*)i, and cr» estimation. The if-means 
bases heavily outperform the hand-chosen bases (CF05 & 
Asa07) for each parameter. The same trends are seen for 
error in estimation of time-binned star formation histories 
(Fig. [8}. The error estimates for all bases, split into S/N=10 
and S/N=20 galaxy spectra, are in Table Q] 

An interesting trend appears in Figs. 17181 For the hand- 
chosen bases, we see considerable reduction in the estima- 
tion error of every parameter in going from the K=45 ba- 
sis of CF05 to the A"=150 basis of Asa07. However, in the 
Tf-means bases, and particularly for the diffusion A"-means 
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Figure 8. Error in SFH estimation for simulated spectra with 
exponentially decreasing SFH and random starbursts. Average 
discrepancies of the estimated condensed light fraction (x c ) and 
condensed mass fraction (/t c ) vectors to the truth, averaged across 
500 simulations. Diffusion /("-means generally produces more ac- 
curate SFH estimates than the other bases considered. All K- 
means bases find much more accurate estimates than the litera- 
ture bases. 



basis, there is no substantial lowering of the MSE of any 
parameter in increasing from _K=45 to K—150. This tells us 
that even with as few as 45 prototypes, diffusion A"-means 
provides us a basis that can successfully fit these particular 
simulations. Moving to 7^=150 prototypes allows us no sig- 
nificant gains in parameter estimation while costing us ~10 
times the computational time. Note, however, that although 
a basis of size 45 may be sufficient for this set of simulations, 
a set of real galaxy spectra may be more complicated and 
require a larger basis of prototypes to achieve good fits. 

A common difficulty in population synthesis studies 
is the so-called age-metallicity degeneracy, where young, 
metal-rich SSPs are confused with old, metal-poor SSPs. 
Using a regular grid of SSP prototypes in population synthe- 
sis can exacerbate this problem because several SSPs with 
nearly identical spectra (but different ages and metallicities) 
will likely be included, inducing degeneracy in the spectral 
model {l}, an d causing the estimated SFH parameters for 
each galaxy to be highly variable. A diffusion if-means basis, 
on the other hand, picks SSP prototypes whose spectra more 
evenly span the range of spectral characteristics, minimizing 
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Table 1. Summary of parameter error estimates for simulated spectra with exponentially decreasing SFH and random starbursts. 
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Figure 9. Age-metallicity degeneracy in parameter estimation 
for simulated spectra with exponentially decreasing SFH and ran- 
dom starbursts. Diffusion if -means bases show significantly less 
degeneracy in parameter estimates than the literature bases, 
shown by a smaller slope in the linear trend. Diffusion if -means 
bases also produce less biased estimates of both parameters. 



the inclusion of very similar spectra and thus decreasing the 
degeneracy effects in the minimization of ([2]). In Fig. [9] we 
see that the age-Z degeneracy is much smaller in diffusion 
K-me&ns bases, quantified by the slope of the linear trend 
in A\og(Z,) l versus A(log£„)i,. Some degeneracy in age- 
metallicity still exists for the diffusion K-me&ns bases, but 
the magnitude of the slope of the trend is half. 

One may note that both log{Z*)i and (logt*)_t are sig- 
nificantly biased for all 4 bases in Fig. [9] This probably 
occurs due to the simulation prescription, which includes 
mostly old SSPs from a large, fine grid of age and Z. One 
may be inclined to use these estimated biases to correct the 
estimates of real galaxy spectra. However, the amount of 
bias is dependent on the particular prescription of SFH in 
the simulations. Since there is no general model that de- 
scribes galaxy SFH for a wide range of galaxies, there is no 
straightforward way to estimate the bias, if any, that will 
occur. 



4 APPLICATION TO SDSS SPECTRA 

In this section, we apply the fitting techniques to a sample of 
SDSS spectra to compare the results using different bases. 
We compare the physical parameters estimated from each 
of three different bases: Asa07, diffusion map with K=45, 
and diffusion map with K—150. We also compare parameter 
estimates to those estimated with the MPA/JHU data base. 
Our goal in this section is to study how the choice of basis 
affects the final parameter estimates for a large set of real 
galaxy spectra. 



4.1 Data preparation 

Our data sample consists of spectra from ten arbitrarily cho- 
sen spectr oscopic plates of SDSS DR6 (0266 —0274 inclusive, 
and 0286; lAdelman-McCarthv et all 120081 ) that are classi- 
fied as galaxy spectra. The goal is to test our methods on a 
sample that is large enough that we include galaxies with a 
broad range of star formation and metallicity histories but 
not so large that analysis is computationally cumbersome. 
In a future work, we will present results for the more than 
700k galaxy spectra in SDSS DR7. 

We remove spectra from this sample by applying three 
cuts. The first, motivated by aperture considerations, is to 
analyse only spectra whose SDSS redshift estimates ^ 0.05. 
Second, we determine the proportion of the 3850 wavelength 
bins that are flagged as bad. If this proportion exceeds 10%, 
we remove the spectrum from the sample; if not, we retain 
the spectrum for further analysis. Third, we select galaxy 
spectra with a SDSS redshift confidence level ^ 0.35. These 
cuts leave us with a sample of 3046 galaxy spectra. 

We bring all spectra to rest frame using the SDSS red- 
shift estimates. In the fitting routines, we exclude any spec- 
tral bins with SDSS flag ^ 2. This eliminates any sky residu- 
als, bad pixels, and artefacts from the data reduction. We ad- 
ditionally mask the pixels in the regions within and around 
emission lines as in lCid Fernandes et al.l l|2005h . Results of 
the STARLIGHT spectral fits give a mean x 2 /N\ of 1.03 for 
each of the three bases used, close to a value of 1 that would 
be expected if wavelength bin fluxes were independently dis- 
tributed Gaussian random variables. Note that this value 
varies considerably from the mean v alue of 0.78 found in 
the an alysis of 50362 SDSS spectra bvlCid" Fernandes et al.l 
(|2005l V This difference can be attributed to our careful han- 
dling of spectral bin error estimates in our rebinning of the 
spectra to a dispersion of 1 measurement per A. 
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4.2 Parameter estimates 

The STARLIGHT fitting routine is used to fit all 3046 SDSS 
spectra with each of the three bases. Each basis yields pa- 
rameter estimates that are substantially different. In Fig. 
[10] we plot each galaxy's \og{Z*)L versus (logt*}i estimate 
for each basis. There are several notable differences in the 
log{Z*}L — (logi»)i, plane for each basis. First, both diffu- 
sion map bases produce estimates that are tightly centered 
around an increasing linear trend while the Asa07 estimates 
are diffusely spread around such a trend. Moreover, the di- 
rection of discrepancy in the Asa07 estimates from the lin- 
ear trend corresponds exactly with the direction of degener- 
acy between old, metal-poor and young, metal-rich galaxies. 
This suggests that the observed variability along this direc- 
tion is not due to the physics of these galaxies, but rather is 
caused by confusion stemming from the choice of basis. 

Secondly, the slopes of the estimated trends produced 
by the diffusion map bases (0.146 for K =45 and 0.180 for 
K=150) are considerably smaller than the slope of the trend 
in the Asa07 estimates (0.250). These estimates have impli- 
cations for the physical models that govern galaxy evolution. 
Finally, the K=45 diffusion map basis estimates no young, 
metal-poor galaxies, showing a sharp decline from densely- 
populated regions of the parameter space to vacant regions. 
This suggests that this small number of prototypes is not 
sufficient to cover the parameter space; particularly, young, 
metal-poor SSPs have been neglected in the basis. 

Next we compare kinematic parameter estimation. Fig. 
[TTJshows that the Asa07 basis tends to consistently estimate 
slightly larger values of Av than either diffusion map basis. 
On average, Asa07 estimates of Ay are 0.045 larger than 
diffusion map K=45 estimates and 0.024 larger than diffu- 
sion map K=150 estimates. Estimates of a, are consistent 
between all three bases. 

Finally, we c ompare age-binned SFH es timates found 
by each basis. Like lCid Fernandes et all l|2005l ). we condense 
the SFH to three parameters: old, xq [tj > 10 9 yr), interme- 



Age-Z, Asa07 



diate, jcj (10 s 



10 9 yr) 



and young, Xy (tj < 10 yr). 



Plots of diffusion map estimates of each component versus 
Asa07 estimates are in Fig. 1121 The estimated SFHs show 
a high degree of correspondence between the bases. On av- 
erage, each diffusion map basis obtains slightly higher es- 
timates of xo and slightly lower estimates of both x^ and 

Xy. 



4.3 Comparisons to the MPA/JHU data base 

We compare the SDSS parameters estimated with the three 
different choices of STARLIGHT basis with parameters ob- 
tained using measured emission line and continua values 
from the MPA/JHU data base. The MPA/JHU collabo- 
ration has made their SDSS spectral m easurements pub- 
licly a vailable on the MPA SPS S websit (fl (iKauffmann et all 



2003; iBrinchmann et all |2004| ; iTremonti et all 120041 ). Al 



though we do not expect any of the basis choices to pro- 
duce estimates that perfectly correspond to the independent 
MPA/JHU estimates, we analyse the relationships between 
the estimates as a further test of the accuracy of each basis. 
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Figure 10. Estimated ages and metallicities for 3046 SDSS 
galaxies. Diffusion i^-means estimates tend to lie tightly around 
a linear trend in this plane while Asa07 estimates show large 
dispersion in the perpendicular direction, which corresponds to 
the direction of age-Z degeneracy in galaxy population synthesis 
studies. 
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Figure 11. Estimated Ay and cr* parameters for 3046 SDSS 
galaxies. The Asa07 Ay estimates tend to be larger than the 
corresponding diffusion map estimates. The a, estimates agree 
for all 3 bases. 
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Figure 12. Age-binned population vector for 3046 SDSS galax- 
ies. Plotted are the Asa07 estimates of the 3-dimcnsional age- 
binned population vector versus those of two diffusion i^-means 
bases. Diffusion map tends to find slightly smaller contributions 
of the oldest populations. 



4.3.1 Ay and A^ eb 

We use the Balmer emission-line ratio Ha/H(3 to estimate 
the nebular extin ction, A^f° , of each galaxy. Following 
IOsterbrockl(|l989l l. we assume case B recombination, a den- 
sity of 100 cm -3 and a temperature of 10 4 K to compute 
A^ h according to ifTT]) - (fTH]l : 

A 1 ^ = cx 3.1/1.47 (17) 
c = -2.70(2.86 - log(Ha/H/3) oha ) (18) 

where we have used the same constants used in lChen et al.l 
(|2009h . 

Fig. H3lshows plots of the values of A^ h computed from 
the MPA/JHU data base versus the values of the stellar ex- 
tinction, Ay, computed by each of the three different bases 
in the STARLIGHT code. We have only plotted the 1755 
SDSS galaxies in our sample whose Ha and Hf3 emission line 
measurements both have signal-to-noise ratio ^ 3. We find 
that Ay* and Ay are loosely linearly related for each basis. 
The estimated Spearman correlation is 0.51 for the Asa07 
basis and 0.52 for each diffusion map basis. The slope of the 
linear trend is slightly higher for the Asa07 basis (1.21) than 
for the diffusion map bases (1.16 for each basis). All three 
bases obtain stellar extinction estimates that are consistent 
with the nebular extinction values A^f° , and there are no 
significant differences between the strength of the relation- 
ships between the estimates of each basis and the nebular 
extinction estimates. 

4.3.2 (log U ) l and D n (4000) & H5 A 

As in IChen et alj (|2009l ). we compare our spectral 
synthesis estimates to two a ge indicators from the 
MPA /JHU database: £^(4000) jBalogh et all 1 19991 ) and 
HS A (|Worthev fc Ottaviani|[l997l ). The value A, (4000) de- 
scribes the break in the 4000 A region of the galaxy spec- 
trum; higher values generally correspond to increasing ages 
of the galaxy's stellar subpopulations. Larger strength of the 
H5 absorption line, HSa, indicates higher star formation 
rates in the past 0.1-1 Gyr of a galaxy's lifetime. Hence, we 
expect our estimates of (logt*)i to correlate positively with 
D,i (4000) and negatively with HSa- 

In Fig. 1141 we see that the (logt»)z, estimates show 
the expected relationships with both D n (4000) and HSa for 
each of the three bases. Correlations are strongly positive for 
the D n (4000) — (log t»}i trend and strongly negative for the 
HSa — (logt*)i trend. We have plotted the 3031 galaxies in 
our sample that have reliable D n (4000) and HSa measure- 
ments in the MPA/JHU catalogue. There is no significant 
difference in each relation between the bases. 

4-3.3 {Z,) M and Zf£ 

Lastly, we compare our STARLIGHT estimates of {Z*)m 
with a measure of nebular metallicity based on the ratio 
of emission lines [OIII]A5007 and [NIIJA6583, provided by 
MPA/JHU. Particularly, the quantity Z na h measures the 
oxygen abundance of a galaxy, and is defined as 

Z nch = -3.45 - .25 x O3N2 (19) 



where Q 3 N 2 = log([OIII]A5007/[NII]A6583) 

(|Pettini fc Pagell 120041 ). All emission line fluxes are first 
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Figure 13. MPA/JHU versus population synthesis reddening 
estimates for SDSS galaxies. All three bases considered find 
estimates that are generally consistent with the independent 
MPA/JHU emission line estimates. 
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Figure 14. MPA/JHU versus population synthesis age indica- 
tors for SDSS galaxies. All three bases considered retrieve esti- 
mates that are consistent with two different MPA/JHU indicators 
of age of stellar population. 



corrected for extinction using the Av estimates found in 
STARLIGHT fits. We would expect (Z») M to be positively 
related to Z ne ^, as the former measures stellar metallicity 
and the latter measures nebular oxygen abundance. 

In Fig. [15] we see that all three bases find {Z*)m es- 
timates that correlate positively with Z no b, and that the 
Z=150 diffusion map basis estimates correlate more strongly 
with Z ne b, with Spearman correlation coefficient of 0.21, 
compared to the Asa07 basis estimates, which have a cor- 
relation of 0.12 with the Z ne ^ estimates. Furthermore, the 
best-fit linear trend of the Z ne b versus diffusion map K=150 
estimates has a significantly higher slope, 0.128 than the lin- 
ear trend of ^ nc b vs. Asa07 basis, 0.056. This result suggests 
that the diffusion map basis obtains more reliable metallicity 
estimates than the Asa07 basis. 



5 DISCUSSION 

We have presented a method to choose an appropriate basis 
of SSP prototypes to be used in population synthesis models 
of observed galaxy spectra. Pa rticularly, we have used the 
STARLIGHT fitting routines of lCid Fernandes et all (|2005T l 
to study the ability of both literature and diffusion if-means 
bases to model both simulated and SDSS galaxy spectra. On 
simulated galaxies, our diffusion if-means basis outperforms 
hand-chosen literature bases as well as bases constructed 
from standard A'-means or PC JT-means. For real SDSS 
galaxy spectra, we show that the diffusion Jf-means ba- 
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Figure 15. MPA/JHU versus population synthesis metallicity 
estimates for SDSS galaxies. MPA/JHU estimates of nebular oxy- 
gen abundance correlate more strongly with diffusion i^-means 
estimates of stellar metallicity than the Asa07 estimates. 



sis finds parameter estimates that diffe r substantial l y from 
the estimates retrieved by the basis of lAsari et all (|2007h . 
The diffusion Tf-means estimates seem to alleviate the age- 
metallicity degeneracy that has been detrimental to previous 
population synthesis studies of large data bases of galaxies. 

At this point, more discussion of the age-metallicity de- 
generacy is warranted. Degeneracy arises because several 
SSPs have very similar spectral characteristics, but different 
ages and metallicities. When degenerate SSPs are included, 
estimates can be highly variable. This is shown in the first 
panel in Fig. [inland in the STARLIGHT code itself, which 
in the presence of degeneracy tends to converge to different 
solutions due to the random walk taken through the pa- 
rameter space by the Metropolis algorithm. The diffusion 
if-means algorithm, on the other hand, includes only the 
mean spectrum and mean parameters of each set of degen- 
erate spectra. This choice minimizes the statistical risk of 
the final solution by decreasing the variance and bias of the 
estimates (our measure of statistical risk is the MSE, which 
is the sum of the variance and squared bias of an estimate) , 
as seen in simulations (Figs. 5-9). Our SFH estimates for 
SDSS spectra are significantly less variable in the direction 
of the age-Z degeneracy, as seen in Fig. 1101 while the relative 
amount of bias is impossible to quantify without knowledge 
of the true SFH of each galaxy. 

In the face of degeneracies and with no other outside 
information available, our proposed method (of picking the 
mean of several degenerate solutions) is the best one can 
do in terms of minimizing the statistical risk of the esti- 
mates. However, with more information about what features 
in the SSP spectra are most strongly related to their ages 
and metallicities, we could potentially improve our estimates 
by incorporating such information in the construction of the 
similarity measure s(bi,b,-) in (3). The weight matrix pro- 
duced by an expert-produced similarity measure may pro- 
duce a more complete coverage of the parameter space in 
our SSP prototypes and help to remove degeneracies in the 
parameter estimates. Further study is required in this direc- 
tion. 

The ultimate goal of this project is to be able to quickly 
and accurately estimate physical property and SFH param- 
eters for a large data base of galaxies. As an extension of 
this work, we plan to study how to quickly and effectively 
extend estimates found for a small set of galaxies using the 
computationally intensive STARLIGHT routine to a large 
database of galaxies. To achieve this goal, we must exploit 
the low-dime nsional structur e of th ese data. This approach 
was taken by iRichards et~aH (|2009T ) in galaxy redshift esti- 
mation. These methods must be further developed and re- 
fined to reliably estimate SFH parameters. 

The methods presented in this paper are applicable in 
a wide range of problems in astrophysical data analysis. 
Diffusion map is a powerful tool to uncover simple struc- 
ture in complicated, high-dimensional data, whether they be 
spectra, photometric colours, two-dimensional images, data 
cubes, etc. Specifically, diffusion map is both more flexible 
and more widely applicable than PCA, which relies on the 
linear correlation structure of the data and assumes that 
the data reside on a low-dimensional, linear manifold. The 
diffusion K-me&ns method can be used in a variety of prob- 
lems often encountered in astrophysics where a large set of 



Accurate SFH estimation in galaxies 15 



observed or model data are described with a few prototype 
examples. 

Future work will attempt to more concretely define cri- 
teria for the performance of algorithms that define K proto- 
types from N data points and to theoretically justify the use 
of diffusion A'-means. We plan to explore more deeply the 
relationship between the if -means methods introduced here 
and to compare the methods introduced in this paper to 
other methods of basis selection, such as Basis Pursuit and 
Matching Pursuit. We also will investigate how meaningful 
standard errors can be attached to our parameter estimates 
using any choice of basis. 
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